Outline

Framing the problem

  1. Supervised learning
    1. classification: categorical \(y_i\) is available for all \(x_i\)
    2. regression: continuous \(y_i\) is available for all \(x_i\)
  2. Unsupervised learning: \(y_i\) unavailable for all \(x_i\)

What type of problem is this? (1/3)

Food servers’ tips in restaurants may be influenced by many factors, including the nature of the restaurant, size of the party, and table locations in the restaurant. Restaurant managers need to know which factors matter when they assign tables to food servers. For the sake of staff morale, they usually want to avoid either the substance or the appearance of unfair treatment of the servers, for whom tips (at least in restaurants in the United States) are a major component of pay.

In one restaurant, a food server recorded the following data on all customers they served during an interval of two and a half months in early 1990. The restaurant, located in a suburban shopping mall, was part of a national chain and served a varied menu. In observance of local law the restaurant offered seating in a non-smoking section to patrons who requested it. Each record includes a day and time, and taken together, they show the server’s work schedule.

What is \(y\)? What is \(x\)?

What type of problem is this? (2/3)

Every person monitored their email for a week and recorded information about each email message; for example, whether it was spam, and what day of the week and time of day the email arrived. We want to use this information to build a spam filter, a classifier that will catch spam with high probability but will never classify good email as spam.

What is \(y\)? What is \(x\)?

What type of problem is this? (3/3)

A health insurance company collected the following information about households:

  • Total number of doctor visits per year
  • Total household size
  • Total number of hospital visits per year
  • Average age of household members
  • Total number of gym memberships
  • Use of physiotherapy and chiropractic services
  • Total number of optometrist visits

The health insurance company wants to provide a small range of products, containing different bundles of services and for different levels of cover, to market to customers.

What is \(y\)? What is \(x\)?

Model form

All (data-centric) models have a fitted values and residuals.

\[y = f(x_1, x_2, ..., x_p) + \varepsilon\]

where \(y\) is the observed response, \(x_1, ..., x_p\) are the observed values of \(p\) predictors and \(\varepsilon\) is the error. We conventionally use \(n\) to specfify the sample size.

  • We might not be able to exactly specify \(f\) in some forms
  • The residuals have some ideal properties, such as uniform over data space, symmetric around the fitted value, and some models impose a distribution.

Accuracy vs interpretability

Predictive accuracy

The primary purpose is to be able to predict \(\widehat{Y}\) for new data. And we’d like to do that well! That is, accurately.

From XKCD

Interpretability

Almost equally important is that we want to understand the relationship between \({\mathbf X}\) and \(Y\). The simpler model that is (almost) as accurate is the one we choose, always.

Person: Why did you predict 42 for this value?

Computer: Awkward silence

From Interpretable Machine Learning

Parametric vs non-parametric

Parametric methods

  • Assume that the model takes a specific form
  • Fitting then is a matter of estimating the parameters of the model
  • Generally considered to be less flexible
  • If assumptions are wrong, performance likely to be poor

Non-parametric methods

  • No specific assumptions
  • Allow the data to specify the model form, without being too rough or wiggly
  • More flexible
  • Generally needs more observations, and not too many variables
  • Easier to over-fit

From XKCD

Black line is true boundary.


Grids (right) show boundaries for two different models.

Reducible vs irreducible error

If the model form is incorrect, the error (solid circles) may arise from wrong shape, and is thus reducible. Irreducible means that we have got the right model and mistakes (solid circles) are random noise.

Flexible vs inflexible

Parametric models tend to be less flexible but non-parametric models can be flexible or less flexible depending on parameter settings.

Bias vs variance

Bias is the error that is introduced by modeling a complicated problem by a simpler problem.

  • For example, linear regression assumes a linear relationship and perhaps the relationship is not exactly linear.
  • In general, but not always, the more flexible a method is, the less bias it will have because it can fit a complex shape better.

Variance refers to how much your estimate would change if you had different training data. Its measuring how much your model depends on the data you have, to the neglect of future data.

  • In general, the more flexible a method is, the more variance it has.
  • The size of the training data can impact on the variance.

Bias

When you impose too many assumptions with a parametric model, or use an inadequate non-parametric model, such as not letting an algorithm converge fully.

When the model closely captures the true shape, with a parametric model or a flexible model.

Variance

This fit will be virtually identical even if we had a different training sample.

Likely to get a very different model if a different training set is used.

Training vs test splits

When data are reused for multiple tasks, instead of carefully spent from the finite data budget, certain risks increase, such as the risk of accentuating bias or compounding effects from methodological errors. Julia Silge

  • Training set: Used to fit the model, might be also broken into a validation set for frequent assessment of fit.
  • Test set: Purely used to assess final models performance on future data.

Training vs test splits

d_bal <- tibble(y=c(rep("A", 6), rep("B", 6)),
                x=c(runif(12)))
d_bal$y
 [1] "A" "A" "A" "A" "A" "A" "B" "B" "B" "B" "B" "B"
set.seed(130)
d_bal_split <- initial_split(d_bal, prop = 0.70)
training(d_bal_split)$y
[1] "A" "A" "B" "A" "B" "A" "B" "A"
testing(d_bal_split)$y
[1] "A" "B" "B" "B"
d_unb <- tibble(y=c(rep("A", 2), rep("B", 10)),
                x=c(runif(12)))
d_unb$y
 [1] "A" "A" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B"
set.seed(132)
d_unb_split <- initial_split(d_unb, prop = 0.70)
training(d_unb_split)$y
[1] "B" "B" "A" "B" "B" "A" "B" "B"
testing(d_unb_split)$y
[1] "B" "B" "B" "B"

Always stratify splitting by sub-groups, especially response variable classes.

d_unb_strata <- initial_split(d_unb, prop = 0.70, strata=y)
training(d_unb_strata)$y
[1] "A" "B" "B" "B" "B" "B" "B" "B"
testing(d_unb_strata)$y
[1] "A" "B" "B" "B"

Measuring accuracy for categorical response

Compute \(\widehat{y}\) from training data, \(\{(y_i, {\mathbf x}_i)\}_{i = 1}^n\). The error rate (fraction of misclassifications) to get the Training Error Rate

\[\text{Error rate} = \frac{1}{n}\sum_{i=1}^n I(y_i \ne \widehat{y}({\mathbf x}_i))\]

A better estimate of future accuracy is obtained using test data to get the Test Error Rate.


Training error will usually be smaller than test error. When it is much smaller, it indicates that the model is too well fitted to the training data to be accurate on future data (over-fitted).

Confusion (misclassification) matrix

predicted
1 0
true 1 a b
0 c d

Consider 1=positive (P), 0=negative (N).

  • True positive (TP): a
  • True negative (TN): d
  • False positive (FP): c (Type I error)
  • False negative (FN): b (Type II error)
  • Sensitivity, recall, hit rate, or true positive rate (TPR): TP/P = a/(a+b)
  • Specificity, selectivity or true negative rate (TNR): TN/N = d/(c+d)
  • Prevalence: P/(P+N) = (a+b)/(a+b+c+d)
  • Accuracy: (TP+TN)/(P+N) = (a+d)/(a+b+c+d)
  • Balanced accuracy: (TPR + TNR)/2 (or average class errors)

Confusion (misclassification) matrix: computing

Two classes

# Write out the confusion matrix in standard form
#| label: oconfusion-matrix-tidy
cm <- a2 %>% count(y, pred) |>
  group_by(y) |>
  mutate(cl_acc = n[pred==y]/sum(n)) 
cm |>
  pivot_wider(names_from = pred, 
              values_from = n) |>
  select(y, bilby, quokka, cl_acc)
# A tibble: 2 × 4
# Groups:   y [2]
  y      bilby quokka cl_acc
  <fct>  <int>  <int>  <dbl>
1 bilby      9      3  0.75 
2 quokka     5     10  0.667
accuracy(a2, y, pred) |> pull(.estimate)
[1] 0.7
bal_accuracy(a2, y, pred) |> pull(.estimate)
[1] 0.71
sens(a2, y, pred) |> pull(.estimate)
[1] 0.75
specificity(a2, y, pred) |> pull(.estimate)
[1] 0.67

More than two classes

# Write out the confusion matrix in standard form
cm3 <- a3 %>% count(y, pred) |>
  group_by(y) |>
  mutate(cl_err = n[pred==y]/sum(n)) 
cm3 |>
  pivot_wider(names_from = pred, 
              values_from = n, values_fill=0) |>
  select(y, bilby, quokka, numbat, cl_err)
# A tibble: 3 × 5
# Groups:   y [3]
  y      bilby quokka numbat cl_err
  <fct>  <int>  <int>  <int>  <dbl>
1 bilby      9      3      0  0.75 
2 numbat     0      2      6  0.75 
3 quokka     5     10      0  0.667
accuracy(a3, y, pred) |> pull(.estimate)
[1] 0.71
bal_accuracy(a3, y, pred) |> pull(.estimate)
[1] 0.78

Receiver Operator Curves (ROC)

The balance of getting it right, without predicting everything as positive.

From wikipedia

Need predictive probabilities, probability of being each class.

a2 |> slice_head(n=3)
# A tibble: 3 × 4
  y     pred  bilby quokka
  <fct> <fct> <dbl>  <dbl>
1 bilby bilby   0.9    0.1
2 bilby bilby   0.8    0.2
3 bilby bilby   0.9    0.1
roc_curve(a2, y, bilby) |>
  autoplot()

Preparing to fit model: IDA



The first thing to do with data is to look at them …. usually means tabulating and plotting the data in many different ways to see what’s going on. With the wide availability of computer packages and graphics nowadays there is no excuse for ducking the labour of this preliminary phase, and it may save some red faces later.

Crowder, M. J. & Hand, D. J. (1990) “Analysis of Repeated Measures”

Pre-processing steps

  • Handle missing values: most modeling methods require complete data.
  • Decide on an appropriate model type: Explore relationships between response and predictors: nonlinear, clustered. Different shapes indicate which model will likely fit better
  • Check model assumptions
  • Feature engineering: Do you have the predictors needed, or could you create better predictors? Should some variables be transformed.
  • Check for anomalies: Are there some observations that are unusal and might affect the fit? (We’ll do this at the end, too.)
  • Split data: Decide on how much of the data should be used for training a model, validation, and testing.
  • Reduce the number of predictors: Too many predictors can result in over-fitting the training data.

Check model assumptions

  • Statistical models tend to be less flexible. They impose strict assumptions, such as linear form, and Gaussian errors. Check by plotting response vs predictors to check the relationship is what is assumed, and spread is uniform.
  • Non-parametric models, and algorithms, don’t explicitly impose assumptions, but the process imposes constraints, that means fit will fail if not satisfied. For example, tree-based methods mostly use a single variable for any split. If the relationship with the response is best explained by a combination of variables, this will result in a poor fit.

Feature engineering

Creating new variables to get better fits is a special skill! Sometimes automated by the method. All are transformations of the original variables. (See tidymodels steps.)

  • scaling, centering, sphering (step_pca)
  • log or square root or box-cox transformation (step_log)
  • ratio of values (step_ratio)
  • polynomials or splines: \(x_1^2, x_1^3\) (step_ns)
  • dummy variables: categorical predictors expanded into multiple new binary variables (step_dummy)
  • Convolutional Neural Networks: neural networks but with pre-processing of images to combine values of neighbouring pixels; flattening of images

After fitting: diagnostics

Diagnosing the fit

Compute and examine the usual diagnostics, some methods have more

  • fit statistics: accuracy, sensitivity, specificity
  • errors/misclassifications
  • variable importance
  • plot residuals, examine the misclassifications
  • check test set is similar to training

Go beyond … Look at the data and the model together!

Wickham et al (2015) Removing the Blindfold

Training - plusses; Test - dots

Model-in-the-data-space

From XKCD


We plot the model on the data to assess whether it fits or is a misfit!


Doing this in high dimensions is considered difficult!

So it is common to only plot the data-in-the-model-space.

NOTE, WE CAN PLOT THESE THINGS IN HIGH DIMENSIONS. But this is for another day.

Data-in-the-model-space

Predictive probabilities are aspects of the model. It is useful to plot. What do we learn here?

But it doesn’t tell you why there is a difference.

Model-in-the-data-space

Model is displayed, as a grid of predicted points in the original variable space. Data is overlaid, using text labels. What do you learn?

One model has a linear boundary, and the other has the highly non-linear boundary, which matches the class cluster better. Also …

Plotting residuals

Code
# Load the Jack-o'-Lantern data
d <- jackolantern_surreal_data[sample(1:5395, 2000),]

ggduo(d, c("x1", "x2", "x3", "x4"), "y")

Data looks uninteresting: weak linear relationships between response and predictors.

Code
# Fit a linear model to the surreal Jack-o'-Lantern data
d_lm <- lm(y ~ ., data = d)

# Plot the residuals to reveal the hidden image
d_all <- augment(d_lm, d)
ggplot(d_all, aes(x=.fitted, y=.resid)) +
  geom_point() +
  theme(aspect.ratio=1)

Residuals have a spooky pattern!

Reading residual plots using a lineup

  • Reading residual plots can be extremely hard!
  • You have to decide whether there is NO PATTERN.
  • This is a good place to use the lineup developed to do inference for data visualisation.
Code
x <- lm(tip ~ total_bill, data = tips)
tips.reg <- data.frame(tips, .resid = residuals(x), 
                       .fitted = fitted(x))
library(ggplot2)
ggplot(lineup(null_lm(tip ~ total_bill, 
                      method = 'rotate'), 
              tips.reg)) +
  geom_point(aes(x = total_bill, 
                 y = .resid)) +
  facet_wrap(~ .sample) +
  theme(axis.text = element_blank(),
        axis.title = element_blank())

Re-sampling statistics

Tidy models

Tidying model output

The package broom gets model results into a tidy format at different levels. (And broom.mixed does this for mixed models.)

  • model level: broom::glance (values such as AIC, BIC, model fit, …)
  • coefficients in the model: broom::tidy (estimate, confidence interval, significance level, …)
  • observation level: broom::augment (fitted values, residuals, predictions, influence, …)
Code
glance(d_lm)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic   p.value    df
      <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>
1     0.220         0.218  1.00      93.8 5.52e-104     6
# ℹ 6 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>,
#   deviance <dbl>, df.residual <int>, nobs <int>
Code
tidy(d_lm)
# A tibble: 7 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  -0.0125    0.0224    -0.555 5.79e- 1
2 x1            1.03      0.126      8.17  5.59e-16
3 x2            0.460     0.268      1.72  8.58e- 2
4 x3            0.622     0.232      2.69  7.29e- 3
5 x4            0.965     0.820      1.18  2.39e- 1
6 x5            0.968     0.192      5.05  4.81e- 7
7 x6            0.930     0.117      7.94  3.40e-15

tidymodels

tidymodels is a set of R packages that make it easier to build, tune, and evaluate statistical and machine learning models — all using the same consistent, tidyverse-style syntax. It provides a framework for the modeling workflow in R:

  1. Preprocess data (with recipes)
  2. Specify models (with parsnip)
  3. Resample or cross-validate (with rsample)
  4. Tune hyperparameters (with tune)
  5. Evaluate and compare models (with yardstick, workflows, broom)

Case study: hotel bookings (1/10)

Source: A predictive modeling case study

About the data: The hotel bookings data from Antonio, Almeida, and Nunes (2019) to predict which hotel stays included children and/or babies, based on the other characteristics of the stays such as which hotel the guests stay at, how much they pay, etc.

Model: We will build a model to predict which actual hotel stays included children and/or babies, and which did not. Our outcome variable children is a factor variable with two levels.

Code
library(tidymodels)
load("data/hotels.rda")
glimpse(hotels)
Rows: 50,000
Columns: 23
$ hotel                          <fct> City_Hotel, City_Ho…
$ lead_time                      <dbl> 217, 2, 95, 143, 13…
$ stays_in_weekend_nights        <dbl> 1, 0, 2, 2, 1, 2, 0…
$ stays_in_week_nights           <dbl> 3, 1, 5, 6, 4, 2, 2…
$ adults                         <dbl> 2, 2, 2, 2, 2, 2, 2…
$ children                       <fct> none, none, none, n…
$ meal                           <fct> BB, BB, BB, HB, HB,…
$ country                        <fct> DEU, PRT, GBR, ROU,…
$ market_segment                 <fct> Offline_TA/TO, Dire…
$ distribution_channel           <fct> TA/TO, Direct, TA/T…
$ is_repeated_guest              <dbl> 0, 0, 0, 0, 0, 0, 0…
$ previous_cancellations         <dbl> 0, 0, 0, 0, 0, 0, 0…
$ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0…
$ reserved_room_type             <fct> A, D, A, A, F, A, C…
$ assigned_room_type             <fct> A, K, A, A, F, A, C…
$ booking_changes                <dbl> 0, 0, 2, 0, 0, 0, 0…
$ deposit_type                   <fct> No_Deposit, No_Depo…
$ days_in_waiting_list           <dbl> 0, 0, 0, 0, 0, 0, 0…
$ customer_type                  <fct> Transient-Party, Tr…
$ average_daily_rate             <dbl> 81, 170, 8, 81, 158…
$ required_car_parking_spaces    <fct> none, none, none, n…
$ total_of_special_requests      <dbl> 1, 3, 2, 1, 4, 1, 1…
$ arrival_date                   <date> 2016-09-01, 2017-0…
Code
hotels |> 
  count(children) |>
  mutate(prop = n/sum(n))
# A tibble: 2 × 3
  children     n   prop
  <fct>    <int>  <dbl>
1 children  4038 0.0808
2 none     45962 0.919 

Case study: hotel bookings (2/10)

Data splitting and re-sampling

Reserve 25% of the stays to the test set. The outcome variable children is pretty imbalanced so use a stratified random sample.

Code
set.seed(656)
splits      <- initial_split(hotels, strata = children)

hotel_other <- training(splits)
hotel_test  <- testing(splits)

# training set proportions by children
hotel_other %>% 
  count(children) %>% 
  mutate(prop = n/sum(n))
# A tibble: 2 × 3
  children     n   prop
  <fct>    <int>  <dbl>
1 children  3046 0.0812
2 none     34454 0.919 
Code
# test set proportions by children
hotel_test  %>% 
  count(children) %>% 
  mutate(prop = n/sum(n))
# A tibble: 2 × 3
  children     n   prop
  <fct>    <int>  <dbl>
1 children   992 0.0794
2 none     11508 0.921 

Case study: hotel bookings (2/10)

Data splitting and re-sampling

The “other” set is further divided into training and validation sets. These two are used to build the model.

The test set is only used when the final model is decided.

Code
set.seed(703)
val_set <- validation_split(hotel_other, 
                            strata = children, 
                            prop = 0.80)
val_set
# Validation Set Split (0.8/0.2)  using stratification 
# A tibble: 1 × 2
  splits               id        
  <list>               <chr>     
1 <split [30000/7500]> validation

Forecasting

Resources